{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "66d0270a-b74f-4110-901e-7960b00297af",
   "metadata": {},
   "source": [
    "# Astra DB\n",
    "\n",
    "This page provides a quickstart for using [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) as a Vector Store."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab8cd64f-3bb2-4f16-a0a9-12d7b1789bf6",
   "metadata": {},
   "source": [
    "> DataStax [Astra DB](https://docs.datastax.com/en/astra/home/astra.html) is a serverless vector-capable database built on Apache Cassandra® and made conveniently available through an easy-to-use JSON API."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2d6ca14-fb7e-4172-9aa0-a3119a064b96",
   "metadata": {},
   "source": [
    "_Note: in addition to access to the database, an OpenAI API Key is required to run the full example._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb9be7ce-8c70-4d46-9f11-71c42a36e928",
   "metadata": {},
   "source": [
    "## Setup and general dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbe7c156-0413-47e3-9237-4769c4248869",
   "metadata": {},
   "source": [
    "Use of the integration requires the corresponding Python package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d00fcf4-9798-4289-9214-d9734690adfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install -qU langchain-astradb"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2453d83a-bc8f-41e1-a692-befe4dd90156",
   "metadata": {},
   "source": [
    "_Make sure you have installed the packages required to run all of this demo:_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56c1f86e-5921-4976-ac8f-1d62e5a512b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install -qU langchain langchain-community langchain-openai datasets pypdf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2910035-e61f-48d9-a110-d68c401b62aa",
   "metadata": {},
   "source": [
    "### Import dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b06619af-fea2-4863-8149-7f239a8c9c82",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "from astrapy.info import CollectionVectorServiceOptions\n",
    "from datasets import load_dataset\n",
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "from langchain_core.documents import Document\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22866f09-e10d-4f05-a24b-b9420129462e",
   "metadata": {},
   "source": [
    "## Import the Vector Store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b32730d-176e-414c-9d91-fd3644c54211",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_astradb import AstraDBVectorStore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68f61b01-3e09-47c1-9d67-5d6915c86626",
   "metadata": {},
   "source": [
    "## DB Connection parameters\n",
    "\n",
    "These are found on your Astra DB dashboard:\n",
    "\n",
    "- the API Endpoint looks like `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`\n",
    "- the Token looks like `AstraCS:6gBhNmsk135....`\n",
    "- you may optionally provide a _Namespace_ such as `my_namespace`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d78af8ed-cff9-4f14-aa5d-016f99ab547c",
   "metadata": {},
   "outputs": [],
   "source": [
    "ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n",
    "ASTRA_DB_APPLICATION_TOKEN = getpass(\"ASTRA_DB_APPLICATION_TOKEN = \")\n",
    "\n",
    "desired_namespace = input(\"(optional) Namespace = \")\n",
    "if desired_namespace:\n",
    "    ASTRA_DB_KEYSPACE = desired_namespace\n",
    "else:\n",
    "    ASTRA_DB_KEYSPACE = None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84a1fe85-a42c-4f15-92e1-f79f1dd43ea2",
   "metadata": {},
   "source": [
    "## Create the vector store\n",
    "\n",
    "There are two ways to create an Astra DB vector store, which differ in how the embeddings are computed.\n",
    "\n",
    "*Explicit embeddings*. You can separately instantiate a `langchain_core.embeddings.Embeddings` class and pass it to the `AstraDBVectorStore` constructor, just like with most other LangChain vector stores.\n",
    "\n",
    "*Integrated embedding computation*. Alternatively, you can use the [Vectorize](https://www.datastax.com/blog/simplifying-vector-embedding-generation-with-astra-vectorize) feature of Astra DB and simply specify the name of a supported embedding model when creating the store. The embedding computations are entirely handled within the database. (To proceed with this method, you must have enabled the desired embedding integration for your database, as described [in the docs](https://docs.datastax.com/en/astra-db-serverless/databases/embedding-generation.html).)\n",
    "\n",
    "**Please choose one method and run the corresponding cells only.**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c435386-e8d5-41f4-a9e5-7b609ef781f9",
   "metadata": {},
   "source": [
    "### Method 1: provide embeddings explicitly\n",
    "\n",
    "This demo will use an OpenAI embedding model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfa5c005-9738-4c53-b8a8-8540fcbb8bad",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"OPENAI_API_KEY\"] = getpass(\"OPENAI_API_KEY = \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3accae6f-73e2-483a-83f7-76eb33558a1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "465b1b16-5363-4c4f-9917-a49e02a86c14",
   "metadata": {},
   "source": [
    "Now you can create the vector store:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b77553b-8bb5-4949-b87b-8c6abac56a26",
   "metadata": {},
   "outputs": [],
   "source": [
    "vstore = AstraDBVectorStore(\n",
    "    embedding=my_embeddings,\n",
    "    collection_name=\"astra_vector_demo\",\n",
    "    api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
    "    token=ASTRA_DB_APPLICATION_TOKEN,\n",
    "    namespace=ASTRA_DB_KEYSPACE,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d5d2bfa-c071-4a5b-8b6e-3daa1b6de164",
   "metadata": {},
   "source": [
    "### Method 2: use Astra Vectorize (embeddings integrated in Astra DB)\n",
    "\n",
    "Here it is assumed that you have\n",
    "\n",
    "- enabled the OpenAI integration in your Astra DB organization,\n",
    "-  added an API Key named `\"MY_OPENAI_API_KEY\"` to the integration, and\n",
    "- scoped it to the database you are using.\n",
    "\n",
    "For more details please consult the [documentation](https://docs.datastax.com/en/astra-db-serverless/integrations/embedding-providers/openai.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d18455d-3fa6-4f9e-b687-3a2bc71c9a23",
   "metadata": {},
   "outputs": [],
   "source": [
    "openai_vectorize_options = CollectionVectorServiceOptions(\n",
    "    provider=\"openai\",\n",
    "    model_name=\"text-embedding-3-small\",\n",
    "    authentication={\n",
    "        \"providerKey\": \"MY_OPENAI_API_KEY\",\n",
    "    },\n",
    ")\n",
    "\n",
    "vstore = AstraDBVectorStore(\n",
    "    collection_name=\"astra_vectorize_demo\",\n",
    "    api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
    "    token=ASTRA_DB_APPLICATION_TOKEN,\n",
    "    namespace=ASTRA_DB_KEYSPACE,\n",
    "    collection_vector_service_options=openai_vectorize_options,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a348678-b2f6-46ca-9a0d-2eb4cc6b66b1",
   "metadata": {},
   "source": [
    "## Load a dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "552e56b0-301a-4b06-99c7-57ba6faa966f",
   "metadata": {},
   "source": [
    "Convert each entry in the source dataset into a `Document`, then write them into the vector store:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a1f532f-ad63-4256-9730-a183841bd8e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "philo_dataset = load_dataset(\"datastax/philosopher-quotes\")[\"train\"]\n",
    "\n",
    "docs = []\n",
    "for entry in philo_dataset:\n",
    "    metadata = {\"author\": entry[\"author\"]}\n",
    "    doc = Document(page_content=entry[\"quote\"], metadata=metadata)\n",
    "    docs.append(doc)\n",
    "\n",
    "inserted_ids = vstore.add_documents(docs)\n",
    "print(f\"\\nInserted {len(inserted_ids)} documents.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79d4f436-ef04-4288-8f79-97c9abb983ed",
   "metadata": {},
   "source": [
    "In the above, `metadata` dictionaries are created from the source data and are part of the `Document`.\n",
    "\n",
    "_Note: check the [Astra DB API Docs](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#_json_api_limits) for the valid metadata field names: some characters are reserved and cannot be used._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "084d8802-ab39-4262-9a87-42eafb746f92",
   "metadata": {},
   "source": [
    "Add some more entries, this time with `add_texts`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6b157f5-eb31-4907-a78e-2e2b06893936",
   "metadata": {},
   "outputs": [],
   "source": [
    "texts = [\"I think, therefore I am.\", \"To the things themselves!\"]\n",
    "metadatas = [{\"author\": \"descartes\"}, {\"author\": \"husserl\"}]\n",
    "ids = [\"desc_01\", \"huss_xy\"]\n",
    "\n",
    "inserted_ids_2 = vstore.add_texts(texts=texts, metadatas=metadatas, ids=ids)\n",
    "print(f\"\\nInserted {len(inserted_ids_2)} documents.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63840eb3-8b29-4017-bc2f-301bf5001f28",
   "metadata": {},
   "source": [
    "_Note: you may want to speed up the execution of `add_texts` and `add_documents` by increasing the concurrency level for_\n",
    "_these bulk operations - check out the `*_concurrency` parameters in the class constructor and the `add_texts` docstrings_\n",
    "_for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary._"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c031760a-1fc5-4855-adf2-02ed52fe2181",
   "metadata": {},
   "source": [
    "## Run searches"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02a77d8e-1aae-4054-8805-01c77947c49f",
   "metadata": {},
   "source": [
    "This section demonstrates metadata filtering and getting the similarity scores back:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1761806a-1afd-4491-867c-25a80d92b9fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = vstore.similarity_search(\"Our life is what we make of it\", k=3)\n",
    "for res in results:\n",
    "    print(f\"* {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eebc4f7c-f61a-438e-b3c8-17e6888d8a0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "results_filtered = vstore.similarity_search(\n",
    "    \"Our life is what we make of it\",\n",
    "    k=3,\n",
    "    filter={\"author\": \"plato\"},\n",
    ")\n",
    "for res in results_filtered:\n",
    "    print(f\"* {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11bbfe64-c0cd-40c6-866a-a5786538450e",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = vstore.similarity_search_with_score(\"Our life is what we make of it\", k=3)\n",
    "for res, score in results:\n",
    "    print(f\"* [SIM={score:3f}] {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b14ea558-bfbe-41ce-807e-d70670060ada",
   "metadata": {},
   "source": [
    "### MMR (Maximal-marginal-relevance) search\n",
    "\n",
    "_Note: the MMR search method is not (yet) supported for vector stores built with Astra Vectorize._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76381ce8-780a-4e3b-97b1-056d6782d7d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = vstore.max_marginal_relevance_search(\n",
    "    \"Our life is what we make of it\",\n",
    "    k=3,\n",
    "    filter={\"author\": \"aristotle\"},\n",
    ")\n",
    "for res in results:\n",
    "    print(f\"* {res.page_content} [{res.metadata}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60fda5df-14e4-4fb0-bd17-65a393fab8a9",
   "metadata": {},
   "source": [
    "### Async\n",
    "\n",
    "Note that the Astra DB vector store supports all fully async methods (`asimilarity_search`, `afrom_texts`, `adelete` and so on) natively, i.e. without thread wrapping involved."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cc86edd-692b-4495-906c-ccfd13b03c23",
   "metadata": {},
   "source": [
    "## Deleting stored documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38a70ec4-b522-4d32-9ead-c642864fca37",
   "metadata": {},
   "outputs": [],
   "source": [
    "delete_1 = vstore.delete(inserted_ids[:3])\n",
    "print(f\"all_succeed={delete_1}\")  # True, all documents deleted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4cf49ed-9d29-4ed9-bdab-51a308c41b8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "delete_2 = vstore.delete(inserted_ids[2:5])\n",
    "print(f\"some_succeeds={delete_2}\")  # True, though some IDs were gone already"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "847181ba-77d1-4a17-b7f9-9e2c3d8efd13",
   "metadata": {},
   "source": [
    "## A minimal RAG chain"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd64b844-846f-43c5-a7dd-c26b9ed417d0",
   "metadata": {},
   "source": [
    "The next cells will implement a simple RAG pipeline:\n",
    "- download a sample PDF file and load it onto the store;\n",
    "- create a RAG chain with LCEL (LangChain Expression Language), with the vector store at its heart;\n",
    "- run the question-answering chain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cbc4dba-0d5e-4038-8fc5-de6cadd1c2a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "!curl -L \\\n",
    "    \"https://github.com/awesome-astra/datasets/blob/main/demo-resources/what-is-philosophy/what-is-philosophy.pdf?raw=true\" \\\n",
    "    -o \"what-is-philosophy.pdf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "459385be-5e9c-47ff-ba53-2b7ae6166b09",
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf_loader = PyPDFLoader(\"what-is-philosophy.pdf\")\n",
    "splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)\n",
    "docs_from_pdf = pdf_loader.load_and_split(text_splitter=splitter)\n",
    "\n",
    "print(f\"Documents from PDF: {len(docs_from_pdf)}.\")\n",
    "inserted_ids_from_pdf = vstore.add_documents(docs_from_pdf)\n",
    "print(f\"Inserted {len(inserted_ids_from_pdf)} documents.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5010a66c-4298-4e32-82b5-2da0d36a5c70",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = vstore.as_retriever(search_kwargs={\"k\": 3})\n",
    "\n",
    "philo_template = \"\"\"\n",
    "You are a philosopher that draws inspiration from great thinkers of the past\n",
    "to craft well-thought answers to user questions. Use the provided context as the basis\n",
    "for your answers and do not make up new reasoning paths - just mix-and-match what you are given.\n",
    "Your answers must be concise and to the point, and refrain from answering about other topics than philosophy.\n",
    "\n",
    "CONTEXT:\n",
    "{context}\n",
    "\n",
    "QUESTION: {question}\n",
    "\n",
    "YOUR ANSWER:\"\"\"\n",
    "\n",
    "philo_prompt = ChatPromptTemplate.from_template(philo_template)\n",
    "\n",
    "llm = ChatOpenAI()\n",
    "\n",
    "chain = (\n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
    "    | philo_prompt\n",
    "    | llm\n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcbc1296-6c7c-478b-b55b-533ba4e54ddb",
   "metadata": {},
   "outputs": [],
   "source": [
    "chain.invoke(\"How does Russel elaborate on Peirce's idea of the security blanket?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "869ab448-a029-4692-aefc-26b85513314d",
   "metadata": {},
   "source": [
    "For more, check out a complete RAG template using Astra DB [here](https://github.com/langchain-ai/langchain/tree/master/templates/rag-astradb)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "177610c7-50d0-4b7b-8634-b03338054c8e",
   "metadata": {},
   "source": [
    "## Cleanup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0da4d19f-9878-4d3d-82c9-09cafca20322",
   "metadata": {},
   "source": [
    "If you want to completely delete the collection from your Astra DB instance, run this.\n",
    "\n",
    "_(You will lose the data you stored in it.)_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd405a13-6f71-46fa-87e6-167238e9c25e",
   "metadata": {},
   "outputs": [],
   "source": [
    "vstore.delete_collection()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
